<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
           "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<!--
TO DO:
enc2native
enc2utf8
stri_enc_toascii
-->



   
<head>
<style type="text/css">
.knitr.inline {
  background-color: #f7f7f7;
  border:solid 1px #B0B0B0;
}
.error {
	font-weight: bold;
	color: #FF0000;
},
.warning {
	font-weight: bold;
}
.message {
	font-style: italic;
}
.source, .output, .warning, .error, .message {
	padding: 0em 1em;
  border:solid 1px #F7F7F7;
}
.source {
  background-color: #f5f5f5;
}
.rimage.left {
  text-align: left;
}
.rimage.right {
  text-align: right;
}
.rimage.center {
  text-align: center;
}
.hl.num {
  color: #AF0F91;
}
.hl.str {
  color: #317ECC;
}
.hl.com {
  color: #AD95AF;
  font-style: italic;
}
.hl.opt {
  color: #000000;
}
.hl.std {
  color: #585858;
}
.hl.kwa {
  color: #295F94;
  font-weight: bold;
}
.hl.kwb {
  color: #B05A65;
}
.hl.kwc {
  color: #55aa55;
}
.hl.kwd {
  color: #BC5A65;
  font-weight: bold;
}
</style>
<title>stringi &ndash; Compatibility Tables</title>

   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<!--    <meta charset='UTF-8' /> -->
   <meta name="Author" content="Marek Gągolewski" />
   <meta http-equiv="Content-Language" content="en-US" />
   
   <meta name="Keywords" content="Rexamine, stringi, ICU, R" />
   <meta name="Description" content="stringi Compatibility Tables" />
   <meta name="Robots" content="index, follow" />
   
<style>
body {
   font-family: "Segoe UI";
   font-size: 14px;
}

h1 {
   font-size: 40px;
   font-weight: 500;
   background: white;
   color: black;
   padding: 5px;
   border-bottom: 3px solid black;
}

h2 {
   font-size: 25px;
   border-bottom: 1px solid gray;
   font-weight: 500;
   background: #d0d0d0;
   padding: 5px;
}

h3 {
   font-size: 20px;
   border-bottom: 1px solid #fafafa;
   font-weight: 500;
   background: #f0f0f0;
   padding: 5px;
}


h4 {
   font-size: 15px;
   font-weight: 500;
   background: #eaeaff;
   padding: 3px;
}

th {
   text-align: left;
   font-size: 10pt;
   padding: 3px;
}

table {
   border: 1px black solid;
}

td {
   border: 1px gray solid;
   padding: 3px;
}

div.columntitle {
   margin-bottom: 1ex;
   font-weight: bold;
   text-align: center;
}

div.column1, div.column2 {
   width: 30%;
   clear: none;
   border: 0;
   margin: 0;
   padding: 1ex;
   display: inline-block;
   height: auto;
   vertical-align: top;
}

div.column3 {
   width: 30%;
   clear: bottom;
   border: 0;
   margin: 0;
   padding: 1ex;
   display: inline-block;
   height: auto;
   vertical-align: top;
}

div.footer {
   margin: 0;
   padding: 1ex;
   background: #f0f0f0;
   font-size: 90%;
}


</style>
</head>

<body>



<body>

<h1>
<a href="http://stringi.rexamine.com">stringi</a> 0.2-2  (2014-04-19 ) Compatibility Tables:  Character Encodings

</h1>

<!-- -------------------------------------------------------------- -->

<h2>Introduction</h2>

<p>As we know, character vectors in R in computer's memory
resemble lists of raw vectors (each ended up with a 00 byte).
Each string has to be properly "decoded"
so that textual information may be read from such a byte stream.
This "decoding scheme" is simply called a character encoding.</p>

<p>In other words, data in computer's memory are just bytes (small integer values) 
&ndash; an encoding is a way to represent characters with such numbers,
it is a semantic "key" to understand a given byte sequence.
For example, in ISO-8859-2 (Central European), the value 177 represents Polish
“a with ogonek”, and in ISO-8859-1 (Western European),
the same value meas the “plus-minus” sign.
Thus, a character encoding is a translation scheme.</p>

<p>Below you will find a list of functions that deal with character encodings.
Mostly, you will use them when reading or writing a text file.
Functions in <a href="http://stringi.rexamine.com">stringi</a>
process each string internally in Unicode,
which is a superset of all character representation schemes.
This is why while working with <a href="http://stringi.rexamine.com">stringi</a>
you will often use (sometimes without even knowing that explicitly) the following
workflow scheme: READ FILE -> CONVERT TO UTF-8 -> PROCESS -> CONVERT BACK TO
DESIRED ENCODING -> WRITE FILE.</p>

TODO: add stri_enc_toutf8

<!-- -------------------------------------------------------------- -->

<h2>Conversion to Raw Vectors</h2>

<h3>Basic Functionality</h3>

<div class='columns'>
<div class='column1'><div class='columntitle'>base</div>
<code><code class="knitr inline">charToRaw()</code></code> &ndash; single string to a raw vector only

<div class="chunk" id="unnamed-chunk-4"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">charToRaw</span><span class="hl std">(</span><span class="hl str">&quot;aA1&quot;</span><span class="hl std">)</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] 61 41 31
</pre></div>
</div></div>


</div>
<div class='column2'><div class='columntitle'>stringr</div>

<em>(none)</em>

</div>
<div class='column3'><div class='columntitle'>stringi</div>

<code><code class="knitr inline">stri_encode()</code></code>
with argument <code><code class="knitr inline">to_raw=TRUE</code></code>
   is vectorized over the first argument;
   it returns a list of raw vectors.

<div class="chunk" id="unnamed-chunk-5"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">stri_encode</span><span class="hl std">(</span><span class="hl str">&quot;aA1&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;&quot;</span><span class="hl std">,</span> <span class="hl kwc">to_raw</span><span class="hl std">=</span><span class="hl num">TRUE</span><span class="hl std">)[[</span><span class="hl num">1</span><span class="hl std">]]</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] 61 41 31
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl kwd">stri_encode</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl str">&quot;aA1&quot;</span><span class="hl std">,</span> <span class="hl str">&quot; &quot;</span><span class="hl std">),</span> <span class="hl str">&quot;&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;&quot;</span><span class="hl std">,</span> <span class="hl kwc">to_raw</span><span class="hl std">=</span><span class="hl num">TRUE</span><span class="hl std">)</span>
</pre></div>
<div class="output"><pre class="knitr r">## [[1]]
## [1] 61 41 31
## 
## [[2]]
## [1] 20
</pre></div>
</div></div>

</div>
</div>

<h3>Performance comparison</h3>

<div class="chunk" id="unnamed-chunk-6"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl std">test1</span> <span class="hl kwb">&lt;-</span> <span class="hl str">&quot;abcdefghijklmnopqrstuvwxyz&quot;</span>
<span class="hl kwd">microbenchmark</span><span class="hl std">(</span><span class="hl kwd">charToRaw</span><span class="hl std">(test1),</span> <span class="hl kwd">stri_encode</span><span class="hl std">(test1,</span> <span class="hl str">&quot;&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;&quot;</span><span class="hl std">,</span> <span class="hl kwc">to_raw</span><span class="hl std">=</span><span class="hl num">TRUE</span><span class="hl std">)[[</span><span class="hl num">1</span><span class="hl std">]])</span>
</pre></div>
<div class="output"><pre class="knitr r">## Unit: microseconds
##                                            expr    min     lq  median      uq    max neval
##                                charToRaw(test1)  1.018  1.575  2.2325  2.5135 12.558   100
##  stri_encode(test1, "", "", to_raw = TRUE)[[1]] 11.782 12.847 13.4095 13.9705 65.901   100
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl std">test2</span> <span class="hl kwb">&lt;-</span> <span class="hl kwd">rep</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl str">&quot;abcdefghijklmnopqrstuvwxyz&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;ABCDEFGHIJKLMNOPQRSTUVWXYZ&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;0123456789&quot;</span><span class="hl std">),</span> <span class="hl num">10</span><span class="hl std">)</span>
<span class="hl kwd">microbenchmark</span><span class="hl std">(</span><span class="hl kwd">lapply</span><span class="hl std">(test2, charToRaw),</span> <span class="hl kwd">stri_encode</span><span class="hl std">(test2,</span> <span class="hl str">&quot;&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;&quot;</span><span class="hl std">,</span> <span class="hl kwc">to_raw</span><span class="hl std">=</span><span class="hl num">TRUE</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## Unit: microseconds
##                                       expr    min      lq  median      uq     max neval
##                   lapply(test2, charToRaw) 27.857 32.1365 35.0335 36.5075  89.783   100
##  stri_encode(test2, "", "", to_raw = TRUE) 20.489 21.7905 22.7170 23.9065 100.140   100
</pre></div>
</div></div>


<!-- -------------------------------------------------------------- -->

<h2>Conversion from Raw Vectors</h2>

<h3>Basic Functionality</h3>

<div class='columns'>
<div class='column1'><div class='columntitle'>base</div>
<code><code class="knitr inline">rawToChar()</code></code> 
&ndash; single raw vector to a single string only

<div class="chunk" id="unnamed-chunk-7"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">rawToChar</span><span class="hl std">(</span><span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl num">97</span><span class="hl std">,</span> <span class="hl num">65</span><span class="hl std">,</span> <span class="hl num">49</span><span class="hl std">)))</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] "aA1"
</pre></div>
</div></div>


</div>
<div class='column2'><div class='columntitle'>stringr</div>

<em>(none)</em>

</div>
<div class='column3'><div class='columntitle'>stringi</div>
   <code><code class="knitr inline">stri_encode()</code></code>  also accepts a raw vector
   or a list of raw vectors as its first argument;
   by default, i.e. when <code><code class="knitr inline">to_raw=FALSE</code></code>,
   the result is a character vector.

<div class="chunk" id="unnamed-chunk-8"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">stri_encode</span><span class="hl std">(</span><span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl num">97</span><span class="hl std">,</span> <span class="hl num">65</span><span class="hl std">,</span> <span class="hl num">49</span><span class="hl std">)),</span> <span class="hl str">&quot;&quot;</span><span class="hl std">)</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] "61" "41" "31"
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl kwd">stri_encode</span><span class="hl std">(</span><span class="hl kwd">list</span><span class="hl std">(</span><span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl num">97</span><span class="hl std">,</span> <span class="hl num">65</span><span class="hl std">,</span> <span class="hl num">49</span><span class="hl std">)),</span>
   <span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl num">32</span><span class="hl std">)),</span> <span class="hl str">&quot;&quot;</span><span class="hl std">)</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] "aA1" " "
</pre></div>
</div></div>

</div>
</div>

<h3>Performance comparison</h3>

<div class="chunk" id="unnamed-chunk-9"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl std">test1</span> <span class="hl kwb">&lt;-</span> <span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl num">97</span><span class="hl opt">:</span><span class="hl num">122</span><span class="hl std">)</span>
<span class="hl kwd">microbenchmark</span><span class="hl std">(</span><span class="hl kwd">rawToChar</span><span class="hl std">(test1),</span> <span class="hl kwd">stri_encode</span><span class="hl std">(test1,</span> <span class="hl str">&quot;&quot;</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## Unit: nanoseconds
##                    expr   min      lq  median      uq    max neval
##        rawToChar(test1)   892  1332.5  1534.0  1815.5   9120   100
##  stri_encode(test1, "") 18978 19507.5 19920.5 20382.0 138394   100
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl std">test2</span> <span class="hl kwb">&lt;-</span> <span class="hl kwd">rep</span><span class="hl std">(</span><span class="hl kwd">list</span><span class="hl std">(</span><span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl num">97</span><span class="hl opt">:</span><span class="hl num">122</span><span class="hl std">),</span> <span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl num">65</span><span class="hl opt">:</span><span class="hl num">90</span><span class="hl std">),</span> <span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl num">48</span><span class="hl opt">:</span><span class="hl num">57</span><span class="hl std">)),</span> <span class="hl num">10</span><span class="hl std">)</span>
<span class="hl kwd">microbenchmark</span><span class="hl std">(</span><span class="hl kwd">lapply</span><span class="hl std">(test2, rawToChar),</span> <span class="hl kwd">stri_encode</span><span class="hl std">(test2,</span> <span class="hl str">&quot;&quot;</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## Unit: microseconds
##                      expr    min     lq median     uq    max neval
##  lapply(test2, rawToChar) 39.063 43.325 45.785 47.876 93.074   100
##    stri_encode(test2, "") 19.470 20.394 20.841 21.780 56.470   100
</pre></div>
</div></div>


<!-- -------------------------------------------------------------- -->

<h2>Conversion to Integer Vectors (i.e. UTF-32)</h2>

<h3>Basic Functionality</h3>

<div class='columns'>
<div class='column1'><div class='columntitle'>base</div>
   <code><code class="knitr inline">utf8ToInt()</code></code> 
   &ndash; single string in UTF-8 to an integer vector only

<div class="chunk" id="unnamed-chunk-10"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">utf8ToInt</span><span class="hl std">(</span><span class="hl kwd">enc2utf8</span><span class="hl std">(</span><span class="hl str">&quot;aA1&quot;</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] 97 65 49
</pre></div>
</div></div>


</div>
<div class='column2'><div class='columntitle'>stringr</div>

<em>(none)</em>

</div>
<div class='column3'><div class='columntitle'>stringi</div>
   <code><code class="knitr inline">stri_enc_toutf32()</code></code>  accepts a character vector on input
   and returns a list of integer vectors;
   like in all other functions from our package, native and UTF-8
   encodings are handled properly

<div class="chunk" id="unnamed-chunk-11"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">stri_enc_toutf32</span><span class="hl std">(</span><span class="hl str">&quot;aA1&quot;</span><span class="hl std">)[[</span><span class="hl num">1</span><span class="hl std">]]</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] 97 65 49
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl kwd">stri_enc_toutf32</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl str">&quot;aA1&quot;</span><span class="hl std">,</span> <span class="hl str">&quot; &quot;</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## [[1]]
## [1] 97 65 49
## 
## [[2]]
## [1] 32
</pre></div>
</div></div>


</div>
</div>


<h3>Performance comparison</h3>

<div class="chunk" id="unnamed-chunk-12"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl std">test1</span> <span class="hl kwb">&lt;-</span> <span class="hl kwd">enc2utf8</span><span class="hl std">(</span><span class="hl str">&quot;abcdefghijklmnopqrstuvwxyz&quot;</span><span class="hl std">)</span>
<span class="hl kwd">microbenchmark</span><span class="hl std">(</span><span class="hl kwd">utf8ToInt</span><span class="hl std">(test1),</span> <span class="hl kwd">stri_enc_toutf32</span><span class="hl std">(test1)[[</span><span class="hl num">1</span><span class="hl std">]])</span>
</pre></div>
<div class="output"><pre class="knitr r">## Unit: nanoseconds
##                          expr  min     lq median     uq   max neval
##              utf8ToInt(test1)  667  727.0  906.5 1043.0  3818   100
##  stri_enc_toutf32(test1)[[1]] 2839 2995.5 3123.5 3247.5 42244   100
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl std">test2</span> <span class="hl kwb">&lt;-</span> <span class="hl kwd">enc2utf8</span><span class="hl std">(</span><span class="hl kwd">rep</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl str">&quot;abcdefghijklmnopqrstuvwxyz&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;ABCDEFGHIJKLMNOPQRSTUVWXYZ&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;0123456789&quot;</span><span class="hl std">),</span> <span class="hl num">10</span><span class="hl std">))</span>
<span class="hl kwd">microbenchmark</span><span class="hl std">(</span><span class="hl kwd">lapply</span><span class="hl std">(test2, utf8ToInt),</span> <span class="hl kwd">stri_enc_toutf32</span><span class="hl std">(test2))</span>
</pre></div>
<div class="output"><pre class="knitr r">## Unit: microseconds
##                      expr    min      lq  median      uq    max neval
##  lapply(test2, utf8ToInt) 32.316 35.0260 36.7355 37.9725 62.752   100
##   stri_enc_toutf32(test2)  7.575  8.5445  9.0055  9.6810 63.978   100
</pre></div>
</div></div>


<!-- -------------------------------------------------------------- -->

<h2>Conversion from Integer Vectors (i.e. UTF-32)</h2>

<h3>Basic Functionality</h3>

<div class='columns'>
<div class='column1'><div class='columntitle'>base</div>
   <code><code class="knitr inline">intToUtf8()</code></code> 
   &ndash; single integer vector to a single string only

<div class="chunk" id="unnamed-chunk-13"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">intToUtf8</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl num">97L</span><span class="hl std">,</span> <span class="hl num">65L</span><span class="hl std">,</span> <span class="hl num">49L</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] "aA1"
</pre></div>
</div></div>


</div>
<div class='column2'><div class='columntitle'>stringr</div>

<em>(none)</em>

</div>
<div class='column3'><div class='columntitle'>stringi</div>
   <code><code class="knitr inline">stri_enc_fromutf32()</code></code> 
   a single integer vector
   or a list of integer vectors as its argument;
   the result is a UTF-8-encoded character vector.

<div class="chunk" id="unnamed-chunk-14"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">stri_enc_fromutf32</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl num">97L</span><span class="hl std">,</span> <span class="hl num">65L</span><span class="hl std">,</span> <span class="hl num">49L</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] "aA1"
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl kwd">stri_enc_fromutf32</span><span class="hl std">(</span><span class="hl kwd">list</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl num">97L</span><span class="hl std">,</span> <span class="hl num">65L</span><span class="hl std">,</span> <span class="hl num">49L</span><span class="hl std">),</span> <span class="hl num">32L</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] "aA1" " "
</pre></div>
</div></div>


</div>
</div>


<h3>Performance comparison</h3>

<div class="chunk" id="unnamed-chunk-15"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl std">test1</span> <span class="hl kwb">&lt;-</span> <span class="hl num">97</span><span class="hl opt">:</span><span class="hl num">122</span>
<span class="hl kwd">microbenchmark</span><span class="hl std">(</span><span class="hl kwd">intToUtf8</span><span class="hl std">(test1),</span> <span class="hl kwd">stri_enc_fromutf32</span><span class="hl std">(test1))</span>
</pre></div>
<div class="output"><pre class="knitr r">## Unit: microseconds
##                       expr   min     lq median     uq    max neval
##           intToUtf8(test1) 1.320 1.4495 1.5295 1.6165  6.733   100
##  stri_enc_fromutf32(test1) 2.331 2.3695 2.4615 2.5525 31.403   100
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl std">test2</span> <span class="hl kwb">&lt;-</span> <span class="hl kwd">rep</span><span class="hl std">(</span><span class="hl kwd">list</span><span class="hl std">(</span><span class="hl num">97</span><span class="hl opt">:</span><span class="hl num">122</span><span class="hl std">,</span> <span class="hl num">65</span><span class="hl opt">:</span><span class="hl num">90</span><span class="hl std">,</span> <span class="hl num">48</span><span class="hl opt">:</span><span class="hl num">57</span><span class="hl std">),</span> <span class="hl num">10</span><span class="hl std">)</span>
<span class="hl kwd">microbenchmark</span><span class="hl std">(</span><span class="hl kwd">lapply</span><span class="hl std">(test2, intToUtf8),</span> <span class="hl kwd">stri_enc_fromutf32</span><span class="hl std">(test2))</span>
</pre></div>
<div class="output"><pre class="knitr r">## Unit: microseconds
##                       expr    min      lq  median     uq    max neval
##   lapply(test2, intToUtf8) 50.286 52.9190 56.1345 57.717 84.865   100
##  stri_enc_fromutf32(test2)  8.274  8.8315  9.2000  9.448 30.225   100
</pre></div>
</div></div>



<!-- -------------------------------------------------------------- -->

<h2>List of Supported Encodings</h2>

<h3>Basic Functionality</h3>

<div class='columns'>
<div class='column1'><div class='columntitle'>base</div>
   <code><code class="knitr inline">iconvlist()</code></code> 
   &ndash; returns a character vector with supported encoding
   names (as well as its aliases).
   
   <p>Note that, as R manual states, the names are rarely valid
   across all platforms.</p>

<div class="chunk" id="unnamed-chunk-16"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">sample</span><span class="hl std">(</span><span class="hl kwd">iconvlist</span><span class="hl std">(),</span> <span class="hl num">4</span><span class="hl std">)</span> <span class="hl com"># a sample of supported encodings</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] "CSIBM9448"        "IBM1143"          "ISO-IR-8-1"       "ISO_8859-14:1998"
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl kwd">length</span><span class="hl std">(</span><span class="hl kwd">iconvlist</span><span class="hl std">())</span> <span class="hl com"># count; Fedora Linux 19 x64_86</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] 1168
</pre></div>
</div></div>


</div>
<div class='column2'><div class='columntitle'>stringr</div>

<em>(none)</em>

</div>
<div class='column3'><div class='columntitle'>stringi</div>
   <code><code class="knitr inline">stri_enc_list()</code></code> 
   with argument <code></code>
   provides a character vector with all supported encodings and
   their aliases in many different forms.
   
   <p>By default, howewer, a list of character vectors
   is returned. Each list element contains the list of aliases for 
   the given encoding.</p>
   
   <p>Please, note that apart from given encodings,
   ICU tries to normalize encoding specifiers, e.g. "utf8" is a valid
   specifier for "UTF-8".</p>
   
   <p>Depending on the version of the ICU library used,
   each encoding should be supported across all platforms.</p>
   
   <p>By the way, <code><code class="knitr inline">stri_enc_info()</code></code>
   returns detailed information of a given encoding specifier.</p>

<div class="chunk" id="unnamed-chunk-17"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">sample</span><span class="hl std">(</span><span class="hl kwd">stri_enc_list</span><span class="hl std">(</span><span class="hl num">TRUE</span><span class="hl std">),</span> <span class="hl num">4</span><span class="hl std">)</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] "windows-1256"             "windows-1250"             "ibm-1235"                 "ibm-949_VASCII_VSUB_VPUA"
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl kwd">length</span><span class="hl std">(</span><span class="hl kwd">stri_enc_list</span><span class="hl std">(</span><span class="hl num">TRUE</span><span class="hl std">))</span> <span class="hl com"># includes aliases</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] 1198
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl kwd">length</span><span class="hl std">(</span><span class="hl kwd">stri_enc_list</span><span class="hl std">())</span> <span class="hl com"># true number of supported encodings</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] 229
</pre></div>
<div class="source"><pre class="knitr r"><span class="hl kwd">str</span><span class="hl std">(</span><span class="hl kwd">stri_enc_info</span><span class="hl std">(</span><span class="hl str">&quot;cp1250&quot;</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## List of 13
##  $ Name.friendly: chr "windows-1250"
##  $ Name.ICU     : chr "ibm-5346_P100-1998"
##  $ Name.UTR22   : chr "ibm-5346_P100-1998"
##  $ Name.IBM     : chr "ibm-5346"
##  $ Name.WINDOWS : chr "windows-1250"
##  $ Name.JAVA    : chr "windows-1250"
##  $ Name.IANA    : chr "windows-1250"
##  $ Name.MIME    : chr NA
##  $ ASCII.subset : logi TRUE
##  $ Unicode.1to1 : logi TRUE
##  $ CharSize.8bit: logi TRUE
##  $ CharSize.min : int 1
##  $ CharSize.max : int 1
</pre></div>
</div></div>


</div>
</div>


<!-- -------------------------------------------------------------- -->

<h2>Convert Strings Between Encodings</h2>

<h3>Basic Functionality</h3>

<div class='columns'>
<div class='column1'><div class='columntitle'>base</div>
   <code><code class="knitr inline">iconv()</code></code> 
   &ndash; converts a character vector between two given encodings.
   Argument <code><code class="knitr inline">from</code></code>
   or <code><code class="knitr inline">to</code></code> equal to ""
   denotes default (native) encoding,
   which is used by R session.

<div class="chunk" id="unnamed-chunk-18"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">utf8ToInt</span><span class="hl std">(</span>
   <span class="hl kwd">iconv</span><span class="hl std">(</span><span class="hl kwd">rawToChar</span><span class="hl std">(</span><span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl num">177</span><span class="hl std">,</span> <span class="hl num">182</span><span class="hl std">))),</span> <span class="hl str">&quot;latin2&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;utf-8&quot;</span><span class="hl std">)</span>
<span class="hl std">)</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] 261 347
</pre></div>
</div></div>


</div>
<div class='column2'><div class='columntitle'>stringr</div>

<em>(none)</em>

</div>
<div class='column3'><div class='columntitle'>stringi</div>
   <code><code class="knitr inline">stri_encode()</code></code> 
   provides a very similar functionality
   as <code><code class="knitr inline">iconv()</code></code>.
   
   <p>Note that currently used default encoding may be obtained by calling
   <code><code class="knitr inline">stri_enc_get()</code></code>
   and changed any time with a call to <code><code class="knitr inline">stri_enc_set()</code></code>.
   This is not dangerous as almost every function in stringi
   returns UTF-8-encoded strings.</p>
   
   <p><code><code class="knitr inline">stri_encode()</code></code> and
   <code><code class="knitr inline">iconv()</code></code> differ in the treatment of
   unsupported characters. If an incorrect code point is found on input,
   <code><code class="knitr inline">stri_encode()</code></code>  replaces it by the default
   (for that target encoding) substitute character and generates a warning.
   <code><code class="knitr inline">iconv()</code></code> in turn, by default silently returns
   <code>NA</code>.

<div class="chunk" id="unnamed-chunk-19"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl kwd">stri_enc_toutf32</span><span class="hl std">(</span>
   <span class="hl kwd">stri_encode</span><span class="hl std">(</span><span class="hl kwd">rawToChar</span><span class="hl std">(</span><span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl kwd">c</span><span class="hl std">(</span><span class="hl num">177</span><span class="hl std">,</span> <span class="hl num">182</span><span class="hl std">))),</span> <span class="hl str">&quot;latin2&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;utf-8&quot;</span><span class="hl std">)</span>
<span class="hl std">)[[</span><span class="hl num">1</span><span class="hl std">]]</span>
</pre></div>
<div class="output"><pre class="knitr r">## [1] 261 347
</pre></div>
</div></div>


</div>
</div>


<h3>Performance comparison</h3>

<div class="chunk" id="unnamed-chunk-20"><div class="rcode"><div class="source"><pre class="knitr r"><span class="hl std">test1</span> <span class="hl kwb">&lt;-</span> <span class="hl kwd">as.raw</span><span class="hl std">(</span><span class="hl num">128</span><span class="hl opt">:</span><span class="hl num">255</span><span class="hl std">)</span>
<span class="hl kwd">microbenchmark</span><span class="hl std">(</span><span class="hl kwd">iconv</span><span class="hl std">(test1,</span> <span class="hl str">&quot;latin2&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;utf8&quot;</span><span class="hl std">),</span> <span class="hl kwd">stri_encode</span><span class="hl std">(test1,</span> <span class="hl str">&quot;latin2&quot;</span><span class="hl std">,</span> <span class="hl str">&quot;utf8&quot;</span><span class="hl std">))</span>
</pre></div>
<div class="output"><pre class="knitr r">## Unit: microseconds
##                                  expr    min      lq  median      uq     max neval
##        iconv(test1, "latin2", "utf8") 53.614 54.9265 55.8775 57.6275 113.434   100
##  stri_encode(test1, "latin2", "utf8")  7.470  8.1910  9.3175 10.4570  81.452   100
</pre></div>
</div></div>



<!-- -------------------------------------------------------------- -->

<h2>Unicode Normalization</h2>

TODO: this is text transform

<h3>Basic Functionality</h3>

<div class='columns'>
<div class='column1'><div class='columntitle'>base</div>

<em>(none)</em>

</div>
<div class='column2'><div class='columntitle'>stringr</div>

<em>(none)</em>

</div>
<div class='column3'><div class='columntitle'>stringi</div>
   <code><code class="knitr inline">stri_trans_isnfc(), stri_trans_isnfkc(), stri_trans_isnfd(), stri_trans_isnfkd(), stri_trans_isnfkc_casefold()</code></code>
   check whether given UTF-8-encoded strings are properly normalized.
   
   <p>Moreover, <code><code class="knitr inline">stri_trans_nfc(), stri_trans_nfkc(), stri_trans_nfd(), stri_trans_nfkd(), stri_trans_nfkc_casefold()</code></code>
   perform the desired normalization.</p>
</div>
</div>

<!-- -------------------------------------------------------------- -->

<h2>Automatic Encoding Detection</h2>

<h3>Basic Functionality</h3>

<div class='columns'>
<div class='column1'><div class='columntitle'>base</div>

<em>(none)</em>

</div>
<div class='column2'><div class='columntitle'>stringr</div>

<em>(none)</em>

</div>
<div class='column3'><div class='columntitle'>stringi</div>
   <code><code class="knitr inline">stri_enc_detect()</code></code>
   and <code><code class="knitr inline">stri_enc_detect2()</code></code>
   provide two experimental facilities for automatic encoding detection.
   The first one uses ICU's native algorithm and the second one
   provides our own implementation for locale-dependent guessing.
   
   <p>TO DO: <code><code class="knitr inline">stri_enc_detect2()</code></code> - choose best
   match from a given list of guesses.</p>
   
   <p>Moreover, the functions
   <code><code class="knitr inline">stri_enc_isascii(), stri_enc_utf8(), str_enci_isutf16le(), stri_enc_isutf16le(), stri_enc_isutf32le(), stri_enc_isutf32le()</code></code>
   check whether given byte sequences
   form a valid character sequence in a given encoding.</p>
</div>
</div>


<!-- -------------------------------------------------------------- -->

<div class="footer"><a rel="license" href="http://creativecommons.org/licenses/by/3.0/"><img alt="Creative Commons License" style="border-width:0; float: left; margin: 8px" src="http://i.creativecommons.org/l/by/3.0/88x31.png"></a><a href="http://www.rexamine.com/">
<img style="float: right; margin: 8px"
src="http://static.rexamine.com/img/Rexamine_logo_transparent3.png" title="Rexamine" alt="Rexamine" /></a><div style="text-align: center">Licensed under the
<a rel="license" href="http://creativecommons.org/licenses/by/3.0/">Creative
Commons Attribution 3.0 Unported License</a>.<br />Copyleft 2013-2014, <a href="http://gagolewski.rexamine.com">Marek Gagolewski</a>.<br />Last updated:  2014-04-19 12:08:02 </div></div>


</body>
</html>
